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Abstract — The problem of maximum-likelihood (ML) estima- 
tion of discrete tree-structured distributions is considered. Chow 
and Liu established that ML-estimation reduces to the construc- 
tion of a maximum-weight spanning tree using the empirical 
mutual information quantities as the edge weights. Using the 
theory of large-deviations, we analyze the exponent associated 
with the error probability of the event that the ML-estimate of 
the Markov tree structure differs from the true tree structure, 
given a set of independently drawn samples. By exploiting the 
fact that the output of ML-estimation is a tree, we establish that 
the error exponent is equal to the exponential rate of decay of a 
single dominant crossover event. We prove that in this dominant 
crossover event, a non-neighbor node pair replaces a true edge 
of the distribution that is along the path of edges in the true tree 
graph connecting the nodes in the non-neighbor pair. Using ideas 
from Euclidean information theory, we then analyze the scenario 
of ML-estimation in the very noisy learning regime and show 
that the error exponent can be approximated as a ratio, which 
is interpreted as the signal-to-noise ratio (SNR) for learning tree 
distributions. We show via numerical experiments that in this 
regime, our SNR approximation is accurate. 

Index Terms — Maximum-Likelihood distribution estimation, 
Markov structure, tree-structured distributions, error exponent, 
large-deviations principle, Euclidean information theory. 



I. Introduction 

The estimation of a distribution from samples is a classical 
and an important generic problem in machine learning and 
statistics and is challenging for high-dimensional multivariate 
distributions. In this respect, graphical models [1] provide a 
significant simplification of joint distribution as the distribution 
can be factorized according to a graph defined on the set of 
nodes. Many specialized algorithms [2]-[8] exist for exact and 
approximate learning of graphical models with sparse graphs. 

When the graph is a tree, the Chow-Liu algorithm [2] 
provides an efficient method for the maximum-likelihood 
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(ML) estimation of the probability distribution from a set 
of i.i.d. samples drawn from the distribution. By exploiting 
the Markov tree structure, this algorithm reduces the ML- 
estimation problem to solving a maximum-weight spanning 
tree (MWST) problem. In this case, it is known that the ML- 
estimator learns the distribution correctly asymptotically, and 
hence, is consistent [9]. 

We are interested in the rate of convergence of the ML- 
estimator for tree distributions as we increase the number 
of samples. Specifically, we study the rate of decay of the 
error probability or the error exponent of the ML-estimator 
in learning the tree structure of the unknown distribution. We 
address the following questions: Is there exponential decay of 
the probability of error in structure learning as the number 
of samples tends to infinity? If so, what is the exact error 
exponent, and how does it depend on the parameters of the 
distribution? Which edges of the true tree are most-likely to 
be in error; in other words, what is the nature of the most- 
likely error in the ML-estimator? We provide concrete and 
intuitive answers to the above questions, thereby providing 
insights into how the parameters of the distribution influence 
the error exponent associated with learning the structure of 
discrete tree distributions. 

A. Main Contributions 

There are three main contributions in this paper. First, using 
the large-deviation principle (LDP) [10] we prove that the 
most-likely error in ML-estimation is a tree which differs from 
the true tree by a single edge. Second, again using the LDP, 
we derive the exact error exponent for ML-estimation of tree 
structures. Third, we provide a succinct and intuitive closed- 
form approximation for the error exponent which is tight in the 
very noisy learning regime, where the individual samples are 
not too informative about the tree structure. The approximate 
error exponent has a very intuitive explanation as the signal- 
to-noise ratio (SNR) for learning. 

We analyze the error exponent (also called the inaccuracy 
rate) for the estimation of the structure of the unknown tree 
distribution. For the error event that the structure of the ML- 
estimator £ ML given n samples differs from the true tree 
structure £p of the unknown distribution P, the error exponent 
is given by 



K P := lim -ilogP({£^£ F }). 

n — ►oo n 



(l) 
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To the best of our knowledge, error-exponent analysis for 
tree-structure learning has not been considered before (See 
Section II-BI for a brief survey of the existing literature on 
learning graphical models from data). 

Finding the error exponent Kp in (Q~|i is highly non-trivial. 
In general, one has to consider the computationally-infeasible 
combinatorial problem of finding the dominant error event 
with the slowest rate of decay among all possible error 
events [10, Ch. 1]. For learning the structure of trees, there 
are a total of d d ~ 2 — 1 possible error events^, where d is the 
dimension (number of variables) of the unknown tree distribu- 
tion P. This rules out brute-force approaches for finding the 
error exponent in (Q~|i, especially for high-dimensional data. 

In contrast, we establish that the search for the dominant 
error event for learning the structure of the tree can be limited 
to a polynomial-time search space (in d). Furthermore, we 
establish that this dominant error event of the ML-estimator is 
given by a tree which differs from the true tree by only a single 
edge. We provide a polynomial algorithm with 0((,(Tp) d?) 
complexity to find the error exponent in (Q~|), where C,{Tp) is 
the diameter of the tree Tp. We heavily exploit the mecha- 
nism of the ML Chow-Liu algorithm [2] for tree learning to 
establish these results, and specifically, the fact that the ML- 
estimator tree distribution depends only on the relative order 
of the empirical mutual information quantities between all the 
node pairs (and not their absolute values). 

Although we provide a computationally-efficient way to 
compute the error exponent in (Q}, it is not available in closed- 
form. In Section lVll we use Euclidean information theory [12], 
[13] to obtain an approximate error exponent in closed-form, 
which can be interpreted as the signal-to-noise ratio (SNR) 
for tree structure learning. Numerical simulations on various 
discrete graphical models verify that the approximation is tight 
in the very noisy regime. 

In Section IVHI we extend our results to the case when the 
true distribution P is not a tree. In this case, given samples 
drawn independently from P, we intend to learn the optimal 
projection P* onto the set of trees. Importantly, if P is not a 
tree, there may be several trees that are optimal projections [9] 
and this requires careful consideration of the error events. We 
derive the error exponent even in this scenario. 

B. Related Work 

The seminal work by Chow and Liu in [2] focused on 
learning tree models from data samples. The authors showed 
that the learning of the optimal tree distribution essentially 
decouples into two distinct steps: (i) a structure learning step 
and (ii) a parameter learning step. The structure learning step, 
which is the focus on this paper, can be performed efficiently 
using a max-weight spanning tree algorithm with the empir- 
ical mutual information quantities as the edge weights. The 
parameter learning step is a maximum-likelihood estimation 
procedure where the parameters of the learned model are equal 
to those of the empirical distribution. Chow and Wagner [9], 

'Since the ML output £ HL and the true structure £p are both spanning 
trees over d nodes and since there are d d ~ 2 possible spanning trees [11], we 
have d d ~ 2 — 1 number of possible error events. 



in a follow-up paper, studied the consistency properties of the 
Chow-Liu algorithm for learning trees. They concluded that if 
the true distribution is Markov on a unique tree structure, then 
the Chow-Liu learning algorithm is asymptotically consistent. 
This implies that as the number of samples tends to infinity, the 
probability that the learned structure differs from the (unique) 
true structure tends to zero. 

Unfortunately, it is known that the exact learning of general 
graphical models is NP-hard [14], but there have been several 
works to learn approximate models. For example, Chechetka 
and Guestrin [3] developed good approximations for learning 
thin junction trees [15] (junction trees where the sizes of the 
maximal cliques are small). Heckerman [16] proposed learning 
the structure of Bayesian networks by using the Bayesian 
Information Criterion [17] (BIC) to penalize more complex 
models and by putting priors on various structures. Other 
authors used the maximum entropy principle or (sparsity- 
enforcing) i\ regularization as approximate graphical model 
learning techniques. In particular, Dudik et al. [8] and Lee et 
al. [5] provide strong consistency guarantees on the learned 
distribution in terms of the log-likelihood of the samples. 
Johnson et al. [6] also used a similar technique known as 
maximum entropy relaxation (MER) to learn discrete and 
Gaussian graphical models. Wainwright et al. [4] proposed 
a regularization method for learning the graph structure based 
on t\ logistic regression and provided strong theoretical guar- 
antees for learning the correct structure as the number of 
samples, the number of variables, and the neighborhood size 
grow. In a similar work, Meinshausen and Buehlmann [7] 
considered learning the structure of arbitrary Gaussian models 
using the Lasso [18]. They show that the error probability 
of learning the wrong structure, under some mild technical 
conditions on the neighborhood size, decays exponentially 
even when the size of the graph d grows with the number 
of samples n. However, the rate of decay is not provided 
explicitly. Zuk et al. [19] provided bounds on the limit inferior 
and limit superior of the error rate for learning the structure of 
Bayesian networks but, in contrast to our work, these bounds 
are not asymptotically tight. In addition, the work in Zuk et 
al. [19] is intimately tied to the BIC [17], whereas our analysis 
is for the Chow-Liu ML tree learning algorithm [2]. 

There have also been a series of papers [20]-[23] that 
quantify the deviation of the empirical information-theoretic 
quantities from their true values by employing techniques from 
large-deviations theory. Some ideas from these papers will turn 
out to be important in the subsequent development because we 
exploit conditions under which the empirical mutual informa- 
tion quantities do not differ "too much" from the true values. 
This will ensure that structure learning succeeds with high 
probability. 

C. Paper Outline 

This paper is organized as follows: In Sections [TT] and [Hi] 
we state the system model and the problem statement and 
provide the necessary preliminaries on undirected graphical 
models and the Chow-Liu algorithm [2] for learning tree 
distributions. In Section ITVl we derive an analytical expression 
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for the crossover rate of two node pairs. We then relate the 
crossover rates to the overall error exponent in Section [V] In 
Section [VI] we leverage on ideas in Euclidean information 
theory to state sufficient conditions that allow approximations 
of the crossover rate and the error exponent. We obtain an 
intuitively appealing closed-form expression. By redefining 
the error event, we extend our results to the case when the 
true distribution is not a tree in Section IVIII We compare the 
true and approximate crossover rates by performing numerical 
experiments for a given graphical model in Section IVIIII 
Perspectives and extensions are discussed in Section llXl 



II. System Model and Problem Statement 

A. Graphical Models 

An undirected graphical model [1] is a probability distribu- 
tion that factorizes according to the structure of an underlying 
undirected graph. More explicitly, a vector of random variables 
x:=[xi, . . . , Xd) T is said to be Markov on a graph Q = (V, £) 
with vertex set V = {1, . . . , d} and edge set £ C (X) if 



P(xi\x v \ {i} ) = P(x z \x nhd(l) ), VieV, 



(2) 



where nbd(z) is the set of neighbors of i in Q, i.e., nbd(z) := 
{j E V : (i,j) E £}. Eq. © is called the (local) Markov 
property and states that if random variable Xi is conditioned 
on its neighboring random variables, then Xi is independent 
of the rest of the variables in the graph. 

In this paper, we assume that each random variable xi E X, 
and we also assume that X is a known finite set0 Hence, 
the joint distribution P E r P(X d ), the set of all distributions 
supported on X d . 

We limit our analysis in this paper to the set of strictly 
positiv^l graphical models P, in which the graph of P is a 
tree on the d nodes, denoted Tp — (V,£p). Thus, Tp is an 
undirected, acyclic and connected graph with vertex set V = 
{1, . . . , d} and edge set £p, with d — 1 edges. Let T d be the 
set of spanning trees on d nodes, and hence, Tp E T d . Tree 
distributions possess the following factorization property [1] 



p(x) = n^fc) n 



iev 



Pj,j {Xj, Xj) 
Pi(xi)Pj(Xj) 



(3) 



where Pi and P\j are the marginals on node i E V and edge 
(i,j) € £p respectively. Since Tp is spanning, Pjj ^ PiPj 
for all E £p. Hence, there is a substantial simplification 
of the joint distribution which arises from the Markov tree 
dependence. In particular, the distribution is completely spec- 
ified by the set of edges £p and pairwise marginals Pij on 
the edges of the tree E £p. In Section \VU\ we extend 

our analysis to general distributions which are not necessarily 
Markov on a tree. 



2 We defer the analysis of learning the structure of jointly Gaussian variables 
where X = R to a companion paper. 

3 A distribution P is said to be strictly positive if -P(x) > for all x 6 X d . 



B. Problem Statement 

In this paper, we consider a learning problem, where 
we are given a set of n i.i.d. e?-dimensional samples 
x" := {xi,...,x„} from an unknown distribution P E 
T J (X d ), which is Markov with respect to a tree Tp E T d . Each 
sample or observation x& := [x^i, . . . , Xk,d] T is a vector of d 
dimensions where each entry can only take on one of a finite 
number of values in the alphabet X. 

Given x", the ML-estimator of the unknown distribution P 
is defined as 



P ML := argmax V] log <3(x fc ), 

Qd-D{X d .T d ) k=1 



(4) 



where V(X d ,T d ) c V(X d ) is defined as the set of all tree 
distributions on the alphabet X d over d nodes. 

In 1968, Chow and Liu showed that the above ML-estimate 
P ML can be found efficiently via a MWST algorithm [2], and 
is described in Section [III] We denote the tree graph of the 
ML-estimate P UL by T ML = (V,£ ML ) with vertex set V and 
edge set £ ML . 

Given a tree distribution P, define the probability of the 
error event that the set of edges is not estimated correctly by 
the ML-estimator as 



A n ■= {£ml 7^ £p} 



(5) 



We denote P := P n as the n-fold product probability measure 
of the n samples x™ which are drawn i.i.d. from P. In this 
paper, we are interested in studying the rate or error exponent 
Kp at which the above error probability exponentially decays 
with the number of samples n, given by, 



K P := lim --\ogP{A„), 

n^oo n 



(6) 



whenever the limit exists. Indeed, we will prove that the limit 
in © exists in the sequel. With the = notation^, (O can be 
written as 

P(A n ) = cxp(-nA». (7) 

A positive error exponent (Kp > 0) implies an exponential 
decay of error probability in ML structure learning, and we 
will establish necessary and sufficient conditions to ensure this. 

Note that we are only interested in quantifying the proba- 
bility of the error in learning the structure of P in (0. We 
are not concerned about the parameters that define the ML 
tree distribution P ML . Since there are only finitely many (but 
a super-exponential number of) structures, this is in fact akin 
to an ML problem where the parameter space is discrete and 
finite [27]. Thus, under some mild technical conditions, we 
can expect exponential decay in the probability of error as 
mentioned in [27]. Otherwise, we can only expect convergence 
with rate 0(1/ y/n) for estimation of parameters that belong to 
a continuous parameter space [28]. In this work, we quantify 

4 In the maximum-likelihood estimation literature (e.g. [24], [25]) if the limit 
in (6} exists, Kp is also typically known as the inaccuracy rate. We will be 
using the terms rate, error exponent and inaccuracy rate interchangeably in 
the sequel. All these terms refer to Kp. 

5 The = notation (used in [26]) denotes equality to the first order in the 
exponent. For two real sequences {a n } and {b n }, a n = b n if and only if 
lirrin^oo i log(a„/6 n ) = 0. 
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the error exponent for learning tree structures using the ML 
learning procedure precisely. 

III. Maximum-Likelihood Learning of Tree 
Distributions from Samples 

In this section, we review the classical Chow-Liu algo- 
rithm [2] for learning the ML tree distribution P ML given a set 
of n samples x™ drawn i.i.d. from a tree distribution P. Recall 
the ML-estimation problem in where 8 WL denotes the set of 
edges of the tree T ML on which P ML is tree-dependent. Note that 
since P ML is tree-dependent, from (0, we have the result that 
it is completely specified by the structure £ ML and consistent 
pairwise marginals P ML (xi,Xj) on its edges G £ M l- 

In order to obtain the ML-estimator, we need the notion of 
a type or empirical distribution of P, given x™, defined as 



P( x |x") := — ^ — = _ \ <5{ Xfc = x}, 



(8) 



fc=i 



where iV(x|x n ) is the number of times x e X d occurred in 
x n . For convenience, in the rest of the paper, we will denote 
the empirical distribution by P(x) instead of P(x|x"). 

Fact 1: The ML-estimator in is equivalent to the fol- 
lowing optimization problem: 



P ML = argmin D(P\\Q), 

QeV(X d .J d ) 



(9) 



where P is the empirical distribution of x", given by ©. 
Proof: By the definition of the KL-divergence, we have 

nD{P\\Q) = -nH{P)-n £ P(x)logQ(x), 

x£X d 



-nH(P) -£logQ(x fc ), 

k=l 



(10) 



where we use the fact that the empirical distribution P in ([H) 
assigns a probability mass of 1/n to each sample x&. ■ 
The minimization over the second variable in (O is also 
known as the reverse I-projection [29] of P onto the set 
of tree distributions T>(X d ,T d ). We now state the main 
result of the Chow-Liu tree learning algorithm [2]. In this 
paper, with a slight abuse of notation, we denote the mutual 
information I(xi;Xj) between two random variables xi and 
Xj corresponding to nodes i and j as: 



Pi,j v^i : ) lO; 



Pi,j y^i i Xj ) 
Pi(xi)Pj(Xj) 



(ID 



If e = (i, j), then we will also denote the mutual information 
as I(P e ) = I{Pij). 

Theorem 1 (Chow-Liu Tree Learning [2]): The structure 
and parameters of the ML-estimate P ML in are given by 

£ ML = argmax V] I{P e ), (12) 

Sq:Q&3{X*,T*) ee£Q 

P UL (x.i,x ) = Pi t j(xi,Xj), V(z,j) € £ ML , (13) 

where P is the empirical distribution in ^ given the data 
x", and I(P e ) — I(Pi,j) is the empirical mutual information 



of random variables Xi and Xj, which is a function of the 
empirical distribution P e . 

Proof: For a fixed tree distribution Q e V(X d ,T d ), Q 
admits the factorization in (O, and we have 



D(P\\Q)+H(P) 

xex d 



HQiM n 



iev 



{i,j)&£Q 



£ £ a,(^^i)log|g^y. (14) 

For a fixed structure £q, it can be shown [2] that the above 
quantity is minimized when the pairwise marginals over the 
edges of £ Q are set to that of P, i.e., for all Q G V(X d , T d ), 

D(P || Q) + H{P) 

>-^E P<(a;<)log^(xi) 



- £ L ^ 

: £ff(Pi)- £ /(Pe) 

iev (i,j)e£c3 



(15) 



(16) 



The first term in (fTSI l is a constant with respect to Q. 
Furthermore, since £q is the edge set of the tree distribution 
Q £ T>(X d , T d ), the optimization for the ML tree distribution 
P ML reduces to the MWST search for the optimal edge set as 
in (O. ■ 

Hence, the optimal tree probability distribution P ML is the 
reverse I-projection of P onto the optimal tree structure given 
by (fT2l . Thus, the optimization problem in (0 essentially re- 
duces to a search for the structure of P ML . The structure of P ML 
completely determines its distribution, since the parameters are 
given by the empirical distribution in (1131 1, To solve (fT2l . we 
use the samples x™ to compute the empirical distribution P 
using ([SJ, then use P to compute I(P e ), for each node pair 
e e (^). Subsequently, we use the set of empirical mutual 
information quantities {I(P e ) : e 6 (2)} as me edge weights 
for the MWST problem@ 

We see that the Chow-Liu MWST spanning tree algorithm 
is an efficient way of solving the ML-estimation problem, 
especially when the dimension d is large. This is because 
there are d d ~ 2 possible spanning trees over d nodes [11] 
ruling out the possibility for performing an exhaustive search 
for the optimal tree structure. In contrast, the MWST can 
be found, say using Kruskal's algorithm [30], [31] or Prim's 
algorithm [32], in 0(d 2 \ogd) time. 

IV. LDP for Empirical Mutual Information 

The goal of this paper is to characterize the error exponent 
for ML tree learning Kp in (0. As a first step, we consider a 

6 If we use the true mutual information quantities as inputs to the MWST, 
then the true edge set Sp is the output. 
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simpler event, which may potentially lead to an error in ML- 
estimation. In this section, we derive the LDP rate for this 
event, and in the next section, we use the result to derive Kp, 
the exponent associated to the error event A n defined in ©. 

Since the ML-estimate uses the empirical mutual informa- 
tion quantities as the edge weights for the MWST algorithm, 
the relative values of the empirical mutual information quanti- 
ties have an impact on the accuracy of ML-estimation. In other 
words, if the order of these empirical quantities is different 
from the true order then it can potentially lead to an error 
in the estimated edge set. Hence, it is crucial to study the 
probability of the event that the empirical mutual information 
quantities of any two node pairs is different from the true 
order. 

Formally, let us consider two distinct node pairs with no 
common nodes e, e' S QQ with unknown distribution P ee / S 
V(X A ). Assume that the order of the true mutual information 
quantities follow I{P e ) > I(P e >). A crossover evenQ occurs 
if the corresponding empirical mutual information quantities 
are of the reverse order, given by 

C e ,ef ■= {l{Pe)<I{Pe>)}- (17) 

As the number of samples n — ► oo, the empirical quantities 
approach the true ones, and hence, the probability of the above 
event decays to zero. When the decay is exponential, we have a 
LDP for the above event, and we term its rate as the crossover 
rate for empirical mutual information quantities, defined as 

J e ,e' := lim --logP(e e ,e'), (18) 

n— >oc n 

assuming the limit in ( fT8l exists. Indeed, we show in the proof 
of Theorem |2] that the limit exists. Intuitively (and as seen in 
our numerical simulations in Section IVIIH . if the difference 
between the true mutual information quantities I(P e ) — I(Pe') 
is large (i.e., I(P e ) ^S> I(P e ')), we expect the probability of the 
crossover event C e , e ' to be small. Thus, the rate of decay would 
be faster and hence, we expect the crossover rate J e ^ e i to be 
large. In the following, we see that J e e ' depends not only on 
the difference of mutual information quantities I(P e ) — I(P e i), 
but also on the distribution P e e * of the variables on node pairs 
e and e', since the distribution P ee / influences the accuracy 
of estimating them. 

Theorem 2 (Crossover Rate for Empirical Mis): The 
crossover rate for a pair of empirical mutual information 
quantities in (fT8l is given by 

J ey = inf {D(Q\\P e , e ,) :I(Q e ,) = I(Q e )} , (19) 

Qevix 4 -) 

where Q e ,Qe' € V(X 2 ) are marginals of Q over node pairs 
e and e', which do not share common nodes, i.e., 

Q e ~J2Q, Qe' := £ Q- (20) 

The infimum in ( [T9| > is attained by some distribution Q* e , 6 
V(X 4 ) satisfying I(Q* e ,) = I(Q%) and J e , e , > 0. 

7 The event C e e / in depends on the number of samples n but we 
suppress this dependence for convenience. 



Proof: (Sketch) The proof hinges on Sanov's theorem [26, 
Ch. 11] and the contraction principle in large-deviations [10, 
Sec. III. 5]. The existence of the minimizer follows from the 
compactness of the constraint set and Weierstrauss' extreme 
value theorem [33, Theorem 4.16]. The rate J e e / is strictly 
positive since we assumed, a-priori, that the two node pairs e 
and e' satisfy I(P e ) > I(P e ')- As a result, Q* E e , ^ P e>e / and 
D (Qt,e> II Pe,e>) > 0. See Appendix |A] for the details. ■ 

In the above theorem, which is analogous to Theorem 3.3 
in [22], we derived the crossover rate J e ^ as a constrained 
minimization over a submanifold of distributions in V(X^ 
(See Fig. |4), and also proved the existence of an optimizing 
distribution Q*. However, it is not easy to further simplify the 
rate expression in ( [T9| > since the optimization is non-convex. 

Importantly, this means that it is not clear how the param- 
eters of the distribution P e e / affect the rate J eje /, hence ( fT9b 
is not intuitive to aid in understanding the relative ease or 
difficulty in estimating particular tree-structured distributions. 
In Section IVI1 we assume that P satisfies some (so-called 
very noisy learning) conditions and use Euclidean information 
theory [12], [13] to approximate the rate in $1% in order to 
gain insights as to how the distribution parameters affect the 
crossover rate J e e < and ultimately, the error exponent Kp for 
learning the tree structure. 

Remark 1: Theorem [2] specifies the crossover rate J e e / 
when the two node pairs e and e' do not have any common 
nodes. If e and e' share one node, then the distribution 
Pe,e' G V(X 3 ) and here, the crossover rate for empirical 
mutual information is 

J e , e , = inf {D(Q\\P ey ):I{Q e ,) = I(Q e )}. (21) 

Qep(x 3 ) 

The results in the sequel do not depend on whether e and e' 
share a common node. 

A. Example: Symmetric Star Graph 

It is now instructive to study a simple example to see how 
the overall error exponent Kp for structure learning in (O 
depends on the set of crossover rates { J e ^ : e, e' € f^)}- We 
consider a graphical model P with an associated tree Tp = 
(V, £p) which is a d-order star with a central node 1 and outer 
nodes 2, . . . , d, as shown in Fig. [TJ The edge set is given by 
£ P = {(l,i):i = 2,...,d}. 

We assign the joint distributions Q a ,Qb € V(X 2 ) and 
Qa,b € V(X 4 ) to the variables in this graph in the following 
specific way: 

1) P M = Q a for all 2 < i < d. 

2) P id = Q b for all 2 < i, j < d, i ^ j. 

3) Pi,i,j,k = Qa,b for all 2 < i,j,k < d, i ^ j ^ k. 
Thus, we have identical pairwise distributions Pi ,; = Q a of 
the central node 1 and any other node i, and also identical 
pairwise distributions P,j = Qb of any two distinct outer 
nodes i and j. Furthermore, assume that I(Q a ) > P\Qb) > 0. 
Note that the distribution Q a t € P(X i ) completely specifies 
the above graphical model with a star graph. Also, from the 
above specifications, we see that Q a and Qb are the marginal 
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Fig. 1 . The star graph with d = 9. Q a is the joint distribution on any pair 
of variables that form an edge e.g., x\ and X2- Qb is me j° mt distribution on 
any pair of variables that do not form an edge e.g., X2 and £3. By symmetry, 
there is only one crossover rate. 



distributions of Q a .b with respect to to node pairs (l,i) and 
(j,k) respectively i.e., 



)a — E Qa,bi Qb — 2_j < ^ a '' 



(22) 



Proposition 3 (Error Exponent for symmetric star graph): 
For the symmetric graphical model with star graph and Q a .b 
as described above, the error exponent for structure learning 
Kp in ©, is equal to the crossover rate between an edge 
and a non-neighbor node pair 

Kp — J e , e ', for any e e £p, e' £ £p, (23) 

where from dl9t . the crossover rate is given by 

J e ,e> = inf {-0(^1,2,3,4 1 1 Qa.b) : /(-Rl.2) = I(R3A)} , 

with i?i 2 and 4 as the marginals of 2,3,4, i.e., 

R%,2 = J]] -Rl,2,3,4i " 



Ra 



Ri 



2,3,4- 



(24) 



Proof: Since there are only two distinct distributions Q a 
(which corresponds to a true edge) and Qb (which corresponds 
to a non-edge), there is only one unique rate J e , e ', namely 
the expression in ( fT9] l with P e e / replaced by Q a ,b- If the 
event C e , e ', in ( TFTT i. occurs, an error definitely occurs. This 
corresponds to the case where any one edge e S £p is replaced 
by any ofner node pair e' not in £p@ ■ 
Hence, we have derived the error exponent for learning a 
symmetric star graph through the crossover rate J e e > between 
any node pair e which is an edge in the star graph and 
another node pair e! which is not an edge. Note that each 
such crossover event results in an error in the learned structure 
since it leads to e' being declared an edge instead of e. Due 
to the symmetry, all such crossover rates between pairs e and 
e' are equal. By the "worst-exponent- wins" rule [10, Ch. 1], it 
is more likely to have a single crossover event than multiple 
ones. Hence, the error exponent is equal to the crossover rate 
between an edge and a non-neighbor pair in the symmetric 
star graph. 

The symmetric star graph possesses symmetry in the distri- 
butions and hence it is easy to relate Kp to a sole crossover 

8 Also see theorem [5] and its proof for the argument that the dominant error 
tree differs from the true tree by a single edge. 



rate. In general, it is not straightforward to derive the error 
exponent Kp from the set of crossover rates {J e . e '} since 
they may not all be equal and more importantly, crossover 
events for different node pairs affect the learned structure £ ML 
in a complex manner. In the next section, we provide an exact 
expression for Kp by identifying the (sole) crossover event 
related to a dominant error tree. Finally, we remark that the 
crossover event C e e ' is related to the notion of neighborhood 
selection in the graphical model learning literature [4], [7]. 

V. Error Exponent for Structure Learning 

The analysis in the previous section characterized the rate 
Je y e' for the crossover event C e , e ' between two empirical 
mutual information pairs. In this section, we connect these 
set of rate functions {J e ,e'} to the quantity of interest, viz., 
the error exponent for ML-estimation of edge set Kp in ©. 

Recall that the event C e . e i denotes an error in estimating the 
order of mutual information quantities. However, such events 
C e . e ' need not necessarily lead to the error event A n in © 
that the ML-estimate of the edge set £ ML is different from the 
true set £p. This is because the ML-estimate £ ML is a tree and 
this global constraint implies that certain crossover events can 
be ignored. In the sequel, we will identify useful crossover 
events through the notion of a dominant error tree. 

A. Dominant Error Tree 

By using the fact that P = P n is a measure (and hence 
is countably-additive), we can decompose the error event for 
structure estimation A n in (0 into a set of mutually-exclusive 
events 



HAn) = P 



U U -(T)) 

' d UT P j / 



, T£T d \{T P } 



E 

TeT d \{T P } 



P(Z4(T)), 
(25) 

where each U n (T) denotes the event that the graph of the 
ML-estimate is a tree T different from the true tree Tp. 
In other words, 

{T HL = T}, if T eT d \ {Tp}, 
0, if T = T P . 



Un(T) := 



(26) 



Note that U n (T) n U n {T') = whenever T ^ T'. The large- 
deviation rate or the exponent for each error event U n (T) is 



T(T) := lim logP(W„(T)). 

n—KX> fi 



(27) 



whenever the limit exists. Among all the error events U n (T), 
we identify the dominant one with the slowest rate of decay. 

Definition 1 (Dominant Error Tree): A dominant error tree 
Tp = (y,£ P ) is a spanning tree given b}Q 



Tp := argmin T(T). 

T£T d \{T P } 



(28) 



Roughly speaking, a dominant error tree is the tree that is 
the most-likely asymptotic output of the ML-estimator in the 
event of an error. Hence, it belongs to the set T d \ {Tp}. In 

9 We will use the notation argmin extensively in the sequel. It is to be 
understood that if there is no unique minimum (e.g. in {28)), then we arbitrarily 
choose one of the minimizing solutions. 



7 



•(e') € Path(e';fp) 



Fig. 2. The path associated to the non-edge e' = (u, v) £ £p, denoted 
Path(e';£p) C £p, is the set of edges along the unique path linking the 
end points of e' = (u,v). The edge r(e') = argmin eSPath(e /. £p ) J e>e / is 
the dominant replacement edge associated to e' ^ Bp. 



the following, we note that the error exponent in (|6]l is equal 
to the exponent of the dominant error tree. 

Proposition 4 {Dominant Error Tree & Error Exponent): 
The error exponent Kp for structure learning is equal to the 
exponent T(Tp) of the dominant error tree Tp. 



Kr 



T{T* F 



(29) 



Proof: From ( f27l i. we can write 

P(Z4(T)) = exp(-nT(T)), VTeT d \{T P }. (30) 
Now from d25l l, we have 

P(An) = Yl ^ (~ nT ( T )) = 6X P (- nT ( T p)) > ( 31 > 
T€T d \{T P ] 

from the "worst-exponent- wins" principle [10, Ch. 1] and the 
definition of the dominant error tree Tp in ( T28l >. ■ 

Thus, by identifying a dominant error tree Tp, we can find 
the error exponent Kp = T(Tp). To this end, we revisit 
the crossover events C e , e ' in ([T7| >. studied in the previous 
section. Consider a non-neighbor node pair e' with respect 
to £ p and the unique path of edges in £p connecting the two 
nodes, which we denote as Path(e'; £p). See Fig. [21 where we 
define the notion of the path given a non-edge e'. Note that e' 
and Path(e';£p) necessarily form a cycle; if we replace any 
edge e S £p along the path of the non-neighbor node pair 
e', the resulting edge set £p \ {e} U {e'} is still a spanning 
tree. Hence, all such replacements are feasible outputs of the 
ML-estimation in the event of an error. As a result, all such 
crossover events C e e < need to be considered for the error 
event for structure learning A„ in ©. However, for the error 
exponent Kp, again by the "worst-exponent-wins" principle, 
we only need to consider the crossover event between each 
non-neighbor node pair e' and its dominant replacement edge 
r(e') € £p defined below. 

Definition 2 (Dominant Replacement Edge): For each non- 
neighbor node pair e' Bp, its dominant replacement edge 
r(e') E £p is defined as the edge in the unique path along £p 
connecting the nodes in e' having the minimum crossover rate 



r(e') := argmin J e ,e', 

eePath(e';£ P ) 



(32) 



where the crossover rate J e . e ' is given by dl9b . 

We are now ready to characterize the error exponent Kp in 
terms of the crossover rate between non-neighbor node pairs 
and their dominant replacement edges. 



Theorem 5 (Error exponent as a single crossover event): 
The error exponent for ML-tree estimation in ® is given by 

Kp = J r ( e *) e * = min min J e e', (33) 

e'^Sp eePath(e';£ P ) 

where r(e*) is the dominant replacement edge, defined in d32t . 
associated to e* ^ £p and e* is the optimizing non-neighbor 
node pair 

e* := argmin J r{e >),e'- (34) 

e'^Sp 

The dominant error tree Tp — (V,£ P ) in (|28T > has edge set 

^=^U{ e *}\{r(e*)}. 

Proof: (Sketch) The edge set of the dominant error tree 
£* P differs from £p in exactly one edge (See Appendix |B). 
This is because if £ p were to differ from £p in strictly more 
than one edge, the resulting error exponent would not be the 
minimum, hence contradicting Proposition [4] To identify the 
dominant error tree, we use the union bound as in (l25l l and 
the "worst-exponent-wins" principle [10, Ch. 1], to conclude 
that the rate that dominates is the minimum J r ( e '),e' over all 
possible non-neighbor node pairs e' ^ £p. See Appendix [B] 
for the details. ■ 
The above theorem relates the set of crossover rates { J e , e ' }, 
which we characterized in the previous section, to the overall 
error exponent Kp, defined in (|6j. Note that the result in ( |33| ) 
is exponentially tight in the sense that P(A n ) = ex-p(~nKp), 
unlike the work in [19], where bounds on the limit inferior and 
the limit superior were established. We numerically compute 
the error exponent Kp for different discrete distributions in 
Section rVTTTl 

From d33l ). we see that if at least one of the crossover rates 
J e e ' in the minimization is zero, the overall error exponent 
Kp is zero. This observation is important for the derivation 
of necessary and sufficient conditions for Kp to be positive, 
and hence, for the error probability to decay exponentially in 
the number of samples n. 

B. Conditions for Exponential Decay 

We now provide necessary and sufficient conditions that 
ensure that Kp is strictly positive. This is obviously of crucial 
importance since if Kp > 0, this implies exponential decay of 
the desired probability of error P(_A„), where the error event 
A n is defined in (|5). 

Theorem 6 (Condition for exponential decay): The follow- 
ing three statements are equivalent. 

(a) The probability of error P(A n ) decays exponentially i.e., 

K P > 0. (35) 

(b) The mutual information quantities satisfy: 

J(iV) ^= I(P e ), VeePath(e';£p), e' ££ P . (36) 

(c) Tp is not a proper forest as was assumed in Section Gj] 
Proof: (Sketch) We first show that (a) <^> (b). 

(=>) We assume statement (a) is true i.e., Kp > and 

10 A proper forest on d nodes is an undirected, acyclic graph that has 
(strictly) fewer than d — 1 edges. 
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Pi(xi) 

f^lO^l) 



P 4 |i(^4|a;i) 




D(Q*,\\Pe,e 



1 nbd(2) 
Fig. 3. Illustration for Example [T] 



nbd(4) 



prove that statement (b) is true. Suppose, to the contrary, that 
I{P e ') — I{Pe) for some e £ Path(e'; £p) and some e' ^ £p. 
Then J r ( e '),e' = 0, where r(e') is the replacement edge 
associated to e' . By (l33i >. Kp = 0, which is a contradiction. 
(4=) We now prove that statement (a) is true assuming state- 
ment (b) is true i.e., I(P e >) ^ I(P e ) for all e £ Path(e';£ P ) 
and e' ^ £p. By Theorem|2] the crossover rate J r ( e '). e ' in (fT~9b 
is positive for all e' ^ £p. From d33l) . l£p > since there are 
only finitely many e', hence the minimum in ( 1341 is attained 
at some non-zero value i.e., Kp = min e r^ £p J r ( e ry e i > 0. 
Statement (c) is equivalent to statement (b). The proof of this 
claim makes use of the positivity condition that P(x) > for 
all x £ X d and the fact that if variables x\, X2 and x 3 form 
Markov chains x\ — x-i — £3 and x\ — X3 — X2, then x\ is 
jointly independent of 2; 2 and £3. Since this proof is rather 
lengthy, we refer the reader to Appendix ICl for the details. ■ 
Condition (b) states that, for each non-edge e', we need I(P e ') 
to be different from the mutual information of its dominant 
replacement edge I(P r ( e ^). Condition (c) is a more intuitive 
condition for exponential decay of the probability of error 
¥(A n )- This is an important result since it says that for any 
non-degenerate tree distribution in which all the pairwise joint 
distributions are not product distributions (i.e., not a proper 
forest), then we have exponential decay in the error probability. 

In the following example, we describe a simple random 
process for constructing a distribution P such that all three 
conditions in Theorem [6] are satisfied with probability one 
(w.p. 1). See Fig. [3] 

Example 1: Suppose the structure of P, a spanning tree 
distribution with graph Tp = (V, £p), is fixed and X = {0, 1}. 
Now, we assign the parameters of P using the following 
procedure. Let x\ be the root node. Then randomly draw the 
parameter of the Bernoulli distribution P\{x\) from a uniform 
distribution on [0, 1] i.e., P\(x\ — 0) = 9 x o and 6 x o ~ U[0, 1]. 
Next let nbd(l) be the set of neighbors of x\. Regard the set 
of variables {xj : j £ nbd(l)} as children of x\. For each 
j £ nbd(l), sample both P(xj — 0\x\ = 0) = x o\ x o as well 
as P(xj = 0\x% = 1) = x oi x i from independent uniform dis- 
tributions on [0, 1] i.e., 9 x o\ x o ~ U[Q, 1] and 9 x o\ x i ~ W[0, 1]. 
Repeat this procedure for all children of X\. Then repeat the 
process for all other children. This construction results in a 
joint distribution P(x) > for all x £ X d w.p. 1. In this 
case, by continuity, all mutual informations are distinct w.p. 
1, the graph is not a proper forest w.p. 1 and the rate Kp > 
w.p. 1. 




£ V{X A ) : J(Q e = I(Q e )} 



Fig. 4. A geometric interpretation of 1191 where P e e i is projected onto 
the submanifold of probability distributions {Q £ 7 3 (A' 4 ) : I(Q e ') = 

i(Q t )ic?(n 



This example demonstrates that P(-4„) decays exponentially 
for almost every tree distribution. More precisely, the tree 
distributions in which V(A n ) does not decay exponentially 
has measure zero in V(X d \ 

C. Computational Complexity 

Finally, we provide an upper bound on the complexity to 
compute Kp in (1331 , The complexity depends on the diameter 
of the tree Tp = (V,£p) defined as 



C(Tp 



max L(u,v)i 



(37) 



where L(u, v) is the length (number of hops) of the unique 
path between nodes u and v. For example, L(u,v) = 4 for 
the non-edge e' = (u, v) in the subtree in Fig. [2] 

Theorem 7 (Computational Complexity for Kp): The 
number of computations of J e e ' to compute Kp, denoted 
N(Kp), satisfies 

N(K P ) < l{(T P )(d-l)(d-2). (38) 

Proof: Given a non-neighbor node pair e' ^ £p, we 
perform a maximum of C(Tp) calculations to determine the 
dominant replacement edge r(e') from i32i . Combining this 
with the fact that there are a total of | ( v 2 ) \£ P \ = - (d- 1) = 
\{d— l)(d — 2) node pairs not in £p, we obtain the upper 
bound. ■ 
Thus, if the diameter of the tree £ is relatively low and 
independent of number of nodes d, the complexity is quadratic 
in d. For instance, for a star graph, the diameter ((Tp) = 2. 
For a balanced tre^lll ((Tp) — 0(\ogd), hence the number 
of computations is O(o? 2 logc?). 

VI. Euclidean Approximations 

In order to gain more insight into the error exponent, we 
make use of Euclidean approximations [13] of information- 
theoretic quantities to obtain an approximate but closed-form 
solution to ( [19] ). which is non-convex and hard to solve exactly. 
Our use of Euclidean approximations for various information- 
theoretic quantities is akin to various problems considered in 
other contexts in information theory [12], [13], [34]. 

We first approximate the crossover rate J e e > for any two 
node pairs e and e', which do not share a common node. 

' 1 A balanced tree is one where no leaf is much farther away from the root 
than any other leaf. The length of the longest direct path between any pair of 
nodes is O(logd). 
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Qe,e' ~ P^-fi' Hp e . c / 




Q%,e> 

( 


P Q(Pe,e') 



Fig. 5. Convexifying the objective results in a least-squares problem. The 
objective is converted into a quadratic as in (44) and the linearized constraint 
set Q(P ee i) is given <45l 



The joint distribution on e and e', namely P ee / belongs to 
the set V(X A ). Intuitively, the crossover rate J e e / should 
depend on the "separation" of the mutual information values 
I(P e ) and I(P e i ), and also on the uncertainty of the difference 
between mutual information estimates I(P e ) and I(P e >). We 
will see that the approximate rate also depends on these mutual 
information quantities given by a simple expression which can 
be regarded as the signal-to-noise ratio (SNR) for learning. 

Roughly speaking, our strategy is to "convexify" the ob- 
jective and the constraints in dT9l . See Figs. H and To do 
so, we recall that if P and Q are two discrete distributions 
with the same support, and they are close entry-wise, the KL 
divergence can be approximated [13] as 



P(x) 



D(Q\\P) = - E Q « 1o Sq^ 



<ex d 



1 



xGX d 

1 £ (Q(x) - P(x))" 



P(x) - Q(x) 
Q(x) 



< 



x£X d 



Q(x) 



Q-P| 



(39) 
(40) 



where ||y||^ denotes the weighted squared norm of y, i.e., 
WuWw '■— J2iVi/ w i- The inequality in d39l holds because 
log(l + z) > z — z 2 /2 for all z > — 1. The inequality becomes 
tight as e = || P — Q||oo — ► 0. Moreover, it remains tight even 
if the subscript Q in d40b is changed to a distribution Q' in the 
vicinity of Q [13]. That is, the difference between ||Q — P||q 
and ||Q — P||q' is negligible compared to either term when 
Q' » Q. Using this fact and the assumption that P and Q are 
two discrete distributions that are close entry-wise, 



D(Q\\P) 



Q 



P\\l- 



(41) 



In fact, it is also known [13] that if ||P— Q||oo < £ f° r some 
e > 0, we also have D{P \\Q) « D(Q \ \ P). 

In the following, to make our statements precise, we will use 
the notation oti xig a 2 to denote that two real numbers a.\ and 
a2 are in the 6 neighborhood of each other, i.e., \a% — a 2 \ < 
We will also need the following notion of information 
density to state our approximation for J e ^. 

12 In the following, we will also have continuity statements where given 
e > and a\ ss £ 02, implies that there exists some 8 = <5(e) > such that 
Pi @2- We will be casual about specifying what the <5's are. 



Definition 3 (Information Density): Given a pairwise joint 
distribution Pjj on X 2 with marginals Pj and Pj, the infor- 
mation density [35], [36] function, denoted by Sij : X 2 — > K, 
is defined as 

P,j \<&i 1 •Ej ) 



iix^Xj) := log- 



^(x i ,x j )eX A . (42) 



Pi(xi)P ) (xj) 

Hence, for each node pair e = (i, j), the information density s e 
can also be regarded as random variable whose expectation is 
simply the mutual information between Xi and Xj, i.e., E[s e ] = 

I(Pe)- 

Recall that we also assumed in Section [XT] that Tp is a 
spanning tree, which implies that for all node pairs P%j 
is not a product distribution, i.e., Pjj 7^ PiP/, because if 
it were, then Tp would be disconnected. We now define a 
condition for which our approximation holds. 

Definition 4 (e-Very Noisy Condition): We say that P e ^i S 
P(A" 4 ), the joint distribution on node pairs e and e', satisfies 
the e-very noisy condition if 

\\P e — -Pe'||oo:= max |P e (xi, Xj) — P e i(xi, xj)\ < e. (43) 

This condition is needed because if d43l holds, then by conti- 
nuity of the mutual information, there exists a 6 > such that 
I(P e ) ~<5 I(Pe'), which means that the mutual information 
quantities are difficult to distinguish and the approximation 
in (l40l > is accurate^ Note that proximity of the mutual 
information values is not sufficient for the approximation to 
hold since we have seen from Theorem [2] that J e e > depends 
not only on the mutual information quantities but on the entire 
joint distribution P e e ' . 

We now define the approximate crossover rate on disjoint 
node pairs e and e' as 



Je 



inf 



^\\Q-Pe : 



2 

I P. 



Q e Q(P e ,eO ^ , (44) 



where the (linearized) constraint set is 

e(iV) : = {q e 7, (^ 4 ) : + (Vp./(p e ), q - iv) 

= J(P e /) + (Vp e ,I(PeO,Q-Pe, e '>}, (45) 

where Vp e /(P e ) is the gradient vector of the mutual informa- 
tion with respect to the joint distribution P e . We also define 
the approximate error exponent as 

Kp := min min J e e / . (46) 

e'<£S r eePath(e';£ P ) 

We now provide the expression for the approximate crossover 
rate J e e / and also state the conditions under which the 
approximation is asymptotically ac curate 

Theorem 8 (Euclidean approximation of J e ,e')-' The 
approximate crossover rate for the empirical mutual 
information quantities, defined in d44i >. is given by 

(E[ Se ,- Se ]) 2 _ (I(P e ,)-I(P e )) 2 



Jp..eJ — 



2Var(s e / - s e ) 



2Var(s e /-s e ) 



(47) 



13 Here and in the following, we do not specify the exact value of 6 but we 
simply note that as e — > 0, the approximation in 44 1 1 becomes tighter. 

l4 We say that a collection of approximations {0(e) : e > 0} of a true 
parameter 9 is asymptotically accurate if the approximations converge to 9 as 
e -> 0. 
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where s e is the information density defined in ( l42l and the 
expectation and variance are both with respect to P e , e /. Fur- 
thermore, the approximation d47| ) is asymptotically accurate, 
i.e., as e — > (in the definition of e-very noisy condition), we 
have that J e e / — > J eje '. 

Proof: (Sketch) Eqs. d44l > and ( |45l > together define a least 
squares problem. Upon simiplification of the solution, we 
obtain d47l >. See Appendix IDl for the details. ■ 
We also have an additional result for the Euclidean approxi- 
mation for the overall error exponent Kp. 

Corollary 9 (Euclidean approximation of Kp): The 
approximate error exponent Kp is asymptotically accurate 
when either one of the following conditions is true. 

(a) The joint distribution P r ( e '), e ' satisfies the e-very noisy 
condition for every e' ^ £p. 

(b) The joint distribution P r ( e *) !e *, where e* is defined 
in d34l > and r(e*) is the dominant replacement edge 
associated to non-edge e*, satisfies the e-very noisy 
condition but all the other distributions on the non- 
neighbor node pairs e' £ £p U {e*} along with their 
dominant replacement edges r(e') do not satisfy the e- 
very noisy condition. 

Proof: Condition (a) follows directly from the continuity 
of the min function in (|46| |. If condition (b) holds, then 
e* ^ £p dominates all the crossover events in (l46l > because 
the rate J r ( e *) !e * is the minimum. Consequently, the dominant 
approximate crossover rate J r ( e *) !e * is asymptotically accurate 
which implies that Kp is also asymptotically accurate. ■ 
Hence, the expressions for the crossover rate J e e ' and the 
error exponent Kp are vastly simplified under the e-very noisy 
condition on the joint distributions P e>e '- The approximate 
crossover rate J e e < in Wl\ has a very intuitive meaning. It 
is proportional to the square of the difference between the 
mutual information quantities of P e and P e i . This corresponds 
exactly to our initial intuition - that if I{P e ) and I(P e <) are 
well separated (I(P e ) ^S> I(P e >)) then the crossover rate has 
to be large. J e e / is also weighted by the precision (inverse 
variance) of (s e < — s e ). If this variance is large then we are 
uncertain about the estimate I(P e ) — I(P e >), and crossovers 
are more likely, thereby reducing the crossover rate J e ^i. 

We now comment on our assumption of P e y satisfying 
the e-very noisy condition, under which the approximation is 
tight as seen in Theorem [8] When P e e ' is e-very noisy, then 
we have I(P e ) ~a I(P e i), which implies that the optimal 
solution of ( fT9T > Q* , »j» P e .e' ■ When e is an edge and 
e' is a non-neighbor node pair, this implies that it is very 
hard to distinguish the relative magnitudes of the empiricals 
I(P e ) and I(P e >). Hence, the particular problem of learning 
the distribution P e e / from samples is very noisy. Under these 
conditions, the approximation in d47| > is accurate. 

In summary, our approximation in ( f4Tb takes into account 
not only the absolute difference between the mutual informa- 
tion quantities I(P e ) and I(P e >), but also the uncertainty in 
learning them. The expression in (PTTT i is, in fact, the SNR 
for the estimation of the difference between empirical mutual 
information quantities. This answers one of the fundamental 
questions we posed in the introduction, viz., that we are now 




Fig. 6. Reverse I-projection [29] of P onto the set of tree distributions 
V(X d ,T d ) given by g8). 
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able to distinguish between distributions that are "easy" to 
learn and those that are "difficult" by computing the set of 
SNR quantities {J e , e '} in d46b . 

VII. Extensions to Non-Tree Distributions 

In all the preceding sections, we dealt exclusively with the 
case where the true distribution P is Markov on a tree. In 
this section, we extend the preceding large-deviation analysis 
to deal with distributions P that may not be tree-structured 
but in which we estimate a tree distribution from the given set 
of samples x", using the Chow-Liu ML-estimation procedure. 
Since the Chow-Liu procedure outputs a tree, it is not possible 
to learn the structure of P correctly. Hence, it will be necessary 
to redefine the error event. 

When P is not a tree distribution, we analyze the properties 
of the optimal reverse I-projection [29] of P onto the set of 
tree distributions, given by the optimization proble 

U*(P) := min D(P\\Q). (48) 

QeE>(A'<*,T<*) 

n*(P) is the KL-divergence of P to the closest element in 
V(X d ,T d ). See Fig. [6] As Chow and Wagner [9] noted, if P 
is not a tree, there may be several trees optimizing d48lF^I We 
denote the set of optimal projections as P*(P), given by 

P*(P) := {Q G V(X d ,r d ) : D(P\\Q) = U*(P)}. (49) 

We now illustrate that P*(P) may have more than one 
element with the following example. 

15 The minimum in the optimization problem in {48} is attained because the 
KL-divergence is continuous and the set of tree distributions T>(X d ,T d ) is 
compact. 

16 This is a technical condition of theoretical interest in this section. In fact, 
it can be shown that the set of distributions such that there is more than one 
tree optimizing j48t has measure zero in P(X d ). 
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V(X d ,T p{1) ) 



D(P\\P& 




V(X d ,T p(2) ) 



D(P\\P^) 



(2) 
;st 

e-flat manifolds 

Fig. 7. Each tree defines an e-flat submanifold [37], [38] of probability 
distributions. These are the two lines as shown in the figure. If the KL- 
divergences D(P\\P^) and D(P\\P^) are equal, then P$ and P e ( s f 
do not have the same structure but both are optimal with respect to the 
optimization problem in {48}. An example of such a distribution P is provided 
in Example [2] 



Example 2: Consider the parameterized discrete probabil- 
ity distribution P € ^({0, l} 3 ) shown in Table U where 
£ € (0, 1/3) and k e (0, 1/2) are constants. 

Proposition 10 (Non-uniqueness of projection): For suffi- 
ciently small k, the Chow-Liu MWST algorithm (using either 
Kruskal's [31] or Prim's [32] procedure) will first include the 
edge (1,2). Then, it will arbitrarily choose between the two 
remaining edges (2,3) or (1,3). 

Thus, the optimal tree structure P* is not unique. See Fig. [7] 
for an information geometric interpretation. 

Every tree distribution in V*{P) has the maximum sum 
mutual information weight. More precisely, we have 



E 

ee£q 



I(Q e ) = _max_ Yl VQeT(P). 



'ev(x d ,T d ) 



ee£ n i 



(50) 

Given J50b . we note that when we use a MWST algorithm 
to find the optimal solution to the problem in (1481 1. ties 
will be encountered during the greedy addition of edges, as 
demonstrated in Example |2] Upon breaking the ties arbitrarily, 
we obtain some distribution Q E V*(P). We now provide a 
sequence of useful definitions that lead to definition of a new 
error event for which we can perform larg e-deviation analysis. 

We denote the set of tree structurea 17 ! corresponding to the 
distributions in V*(P) as 



VIP) 



{Tq eT d : Q £ V*(P)}, 



(51) 



and term it as the set of optimal tree projections. A similar 
definition applies to the edge sets of optimal tree projections 

£v(P) ■= {£ Q :T Q = (V,£ Q )eT d ,QeV*(P)}. (52) 

Since the distribution P is unknown, our goal is to estimate 
the optimal tree-projection P est using the empirical distribution 
P, where P est is given by 



P m := argmin D(P\\Q). 

Q£T>(X d ,T d ) 



(53) 



If there are many distributions Q, we arbitrarily pick one of 
them. We will see that by redefining the error event, we will 
have still a LDP. Finding the reverse I-projection P est can be 
solved efficiently (in time 0(cP logd)) using the Chow-Liu 
algorithm [2] as described in Section 11111 

17 In fact, each tree defines a so-called e-flat submanifold [37], [38] in the set 
of probability distributions on X d and P cs t lies in both submanifolds. The so- 
called m-geodesic connects P to any of its optimal projection P e st G P*(P). 



We define Tp csl = (V,£p m ) as the graph of P est , which is 
the learned tree and redefine the new error event as 



A n (V*(P)) := {£ P- i £v { P)}. 



(54) 



Note that this new error event essentially reduces to the 
original error event A„ = A n ({P}) in Q if T-p*tp\ contains 
only one member. So if the learned structure belongs to 
£*p*tp\, there is no error, otherwise an error is declared. 
We would like to analyze the decay of the error probability 
of A n (P*(P)) as defined in (l54l . i.e., find the new error 
exponent 



K 



V(P) 



lim --log¥(A n (V*(P))). (55) 



It turns out that the analysis of the new event A n (V*(P)) 
is very similar to the analysis performed in Section [V] We 
redefine the notion of a dominant replacement edge and the 
computation of the new rate K-p*/p) then follows automati- 
cally. 

Definition 5 (Dominant Replacement Edge): Fix an edge 
set £q <E &p*(p)- For the error event A n {V*{P)) defined 
in d54l i. given a non-neighbor node pair e' ^ £q, its dominant 
replacement edge r{e';£o) with respect to £q, is given by 



r(e';£ Q ) := 



argmin 

e£Path(e';£'Q) 
e Q U{e'}\{eHS T * (P) 



J, 



(56) 



if there exists an edge e 6 Path(e^ '\£q) sucn tnat £q U 
{e'} \ {e} ^ &p*f P y Otherwise r{e';£o) — 0. J e , e ' is the 
crossover rate of mutual information quantities defined in < TT~8T >. 
If r(e']£o) exists, the corresponding crossover rate is 



J, 



r{e';£Q),e' 



mm 

eePath(e';fQ) 
£ Q U{e'}\{e}££„. (P) 



J, 



(57) 



otherwise e i = +oo. 

In (1561 . we are basically fixing an edge set £q E £-p*/p) 
and excluding the trees with e 6 Path(e';£'Q) replaced by e' 
if it belongs to the set of optimal tree projections T-p*r P y 
We further remark that in d56l l. r(e') may not necessarily 
exist. Indeed, this occurs if every tree with e S Path(e';5g) 
replaced by e' belongs to the set of optimal tree projections. 
This is, however, not an error by the definition of the error 
event in d54T i hence, we set Jq e i = +oo. In addition, we define 
the dominant non-edge associated to edge set £q £ £v(P) as: 



i*{£q) := argmin min J e 

e'££<3 eePath(e':£Q) 

£«U{e'}\{e}^£ T ,, ( p ) 



(58) 



Also, the dominant structure in the set of optimal tree projec- 
tions is defined as 



argmin 



J, 



r(e*(£Q);£ Q ),e*(£Q)i 



(59) 



where the crossover rate J r te';S Q ),e' lS defined in d57T i and the 
dominant non-edge c*(£q) associated to £q is defined in d58l . 
Equipped with these definitions, we are now ready to state the 
generalization of Theorem 
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Theorem 1 1 (Dominant Error Tree): For the error event 
A n (V* (P)) defined in d54t , a dominant error tree (which may 
not be unique) has edge set given by 



£ P *U{e*(£p.)}\{r(e*(£ P .);£ P *)}, 



(60) 



where e*(£p*) is the dominant non-edge associated to the 
dominant structure £p* 6 £p*(p) and is defined by d58l and 
(1591 , Furthermore, the error exponent /vp.(p), defined in i 
is given as 



K 



V*(P) 



mm mm mm 

£q££-p*(p) e'<££ Q e6Path(e';£Q) 



Je 



(61) 



Proof: The proof of this theorem follows directly by iden- 
tifying the dominant error tree belonging to the set T d \T-p* < p\ . 
By further applying the result in Proposition |4] and Theorem|5] 
we obtain the result via the "worst-exponent-wins" [10, Ch. 
1] principle by minimizing over all trees in the set of optimal 
projections £p*(p) in (16 It . ■ 
This theorem now allows us to analyze the more general error 
event A n (P*(P)), which includes A n in (0 as a special 
case if the set of optimal tree projections 7^»(p) in d5TT > is 
a singleton. 

VIII. Numerical Experiments 

In this section, we perform a series of numerical experi- 
ments with the following three objectives: 

1) In Section IVIII-AI we study the accuracy of the Eu- 
clidean approximations (Theorem |8). We do this by an- 
alyzing under which regimes the approximate crossover 
rate J e , e / in (|47| > is close to the true crossover rate J e y 
in CLUl. 

2) Since the LDP and error exponent analysis are asymp- 
totic theories, in Section IVIII-BI we use simulations to 
study the behavior of the actual crossover rate, given 
a finite number of samples n. In particular, we study 
how fast the crossover rate, obtained from simulations, 
converges to the true crossover rate. To do so, we 
generate a number of samples from the true distribution 
and use the Chow-Liu algorithm to learn trees structures. 
Then we compare the result to the true structure and 
finally compute the error probability. 

3) In Section [VIII-CI we address the issue of the learner not 
having access to the true distribution, but nonetheless 
wanting to compute an estimate of the crossover rate. 
The learner only has the samples x" or equivalently, the 
empirical distribution P. However, in all the preceding 
analysis, to compute the true crossover rate J e , e > and the 
overall error exponent Kp, we used the true distribution 
P and solved the constrained optimization problem 
in (1191 . Alternatively we computed the approximation 
in d47T >. which is also a function of the true distribu- 
tion. However, in practice, it is also useful to compute 
an online estimate of the crossover rate by using the 
empirical distribution in place of the true distribution in 
the constrained optimization problem in (1191 . This is a 
estimate of the rate that the learner can compute given 
the samples. We call this the empirical rate and formally 




Fig. 8. Graphical model used for our numerical experiments. The true 
model is a symmetric star (cf. Section llV-At in which the mutual information 
quantities satisfy I (Pi, 2) = I(Pi,3) = /(Pi, 4) and by construction, 
I(P e i) < I (Pi, 2) for any non-edge e'. Besides, the mutual information 
quantities on the non-edges are equal, for example, I(P2,:i) = I(Pi,4,)- 



define it in Section IVIII-CI We perform convergence 
analysis of the empirical rate and also numerically verify 
the rate of convergence to the true crossover rate. 
In the following, we will be performing numerical experiments 
for the undirected graphical model with four nodes as shown in 
Fig. [8] We parameterize the distribution with d = 4 variables 
with a single parameter 7 > and let X = {0, 1}, i.e., all the 
variables are binary. For the parameters, we set P%(xi = 0) = 
1/3 and 



P i{1 (x i =0\x 1 =0) = -+ 1 , % = 2,3,4, 



P ill {x i = 0\x 1 = 1) 



7, 



2,3,4. 



(62) 



With this parameterization, we see that if 7 is small, the 
mutual information I(P lti ) for i — 2,3,4 is also small. In 
fact if 7 = 0, Xi is independent of for i — 2,3,4 and as 
a result, I(Pii) = 0. Conversely, if 7 is large, the mutual 
information /(P^j) increases as the dependence of the outer 
nodes with the central node increases. Thus, we can vary the 
size of the mutual information along the edges by varying 7. 
By symmetry, there is only one crossover rate and hence this 
crossover rate is also the error exponent for the error event A n 
in (0. This is exactly the same as the symmetric star graph 
as described in Section HV-AI 

A. Accuracy of Euclidean Approximations 

We first study the accuracy of the Euclidean approximations 
used to derive the result in Theorem [8] We denote the true 
rate as the crossover rate resulting from the non-convex 
optimization problem ( fT9l and the approximate rate as the 
crossover rate computed using the approximation in ( |47| >. 

We vary 7 from to 0.2 and plot both the true and 
approximate rates against the difference between the mutual 
informations I(P e ) — PyPe>) in Fig. |9l where e denotes any 
edge and e' denotes any non-edge in the model. The non- 
convex optimization problem was performed using the Matlab 
function fmincon in the optimization toolbox. We used sev- 
eral different feasible starting points and chose the best optimal 
objective value to avoid problems with local minima. We first 
note from Fig. [9] that both rates increase as I(P e ) — I(P e ') 
increases. This is in line with our intuition because if P ee > 
is such that I(P e ) — I(Pe') is large, the crossover rate is 
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Fig. 9. Comparison of True and Approximate Rates. 



also large. We also observe that if I(P e ) — I(P e ') is small, 
the true and approximate rates are very close. This is in 
line with the assumptions for Theorem [8] Recall that if P ee i 
satisfies the e-very noisy condition (for some small e), then the 
mutual information quantities I{P e ) and I(P e >) are close and 
consequently the true and approximate crossover rates are also 
close. When the difference between the mutual informations 
increases, the true and approximate rate separate from each 
other. 

B. Comparison of True Crossover Rate to the Rate obtained 
from Simulations 

In this section, we compare the true crossover rate in ( fl9l ) to 
the rate we obtain when we learn tree structures using Chow- 
Liu with i.i.d. samples drawn from P, which we define as 
the simulated rate. We fixed 7 > in (f62t then for each 
n, we estimated the probability of error using the Chow-Liu 
algorithm as described in Section [III] We state the procedure 
precisely in the following steps. 

1) Fix n € N and sample n i.i.d. observations x™ from P. 

2) Compute the empirical distribution P and the set of em- 
pirical mutual information quantities {I(P e ) : e S (2)}- 

3) Learn the Chow-Liu tree £ HL using a MWST algorithm 
with {I(P e ) ■ e € (2)} as the edge weights. 

4) If Ey^ is not equal to £p, then we declare an error. 

5) Repeat steps 1 - 4 a total of M € N times and 
estimate the probability of error P(.4„) = #errors/M 
and the error exponent — (1/n) logP(.4 n ), which is the 
simulated rate. 

If the probability of error P(A n ) is very small, then the number 
of runs M to estimate F(A n ) has to be fairly large. This 
is often the case in error exponent analysis as the sample 
size needs to be substantial to estimate very small error 
probabilities. 

In Fig. QJJ we plot the true rate, the approximate rate and 
the simulated rate when 7 = 0.01 (and M — 10 7 ) and 7 = 0.2 
(and M = 5 x 10 8 ). Note that, in the former case, the true 
rate is higher than the approximate rate and in the latter case, 
the reverse is true. When 7 is large (7 = 0.2), there are large 
differences in the true tree models. Thus, we expect that the 
error probabilities to be very small and hence M has to be 
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Fig. 10. Comparison of True, Approximate and Simulated Rates with 7 = 
0.01 (top) and 7 = 0.2 (bottom). Here the number of runs M = 10 7 for 
7 = 0.01 and M = 5 X 10 s for 7 = 0.2. The probability of error is 
computed dividing the total number of errors by the total number of runs. 



large in order to estimate the error probability correctly but n 
does not have to be too large for the simulated rate to converge 
to the true rate. On the other hand, when 7 is small (7 = 0.01), 
there are only subtle differences in the graphical models, hence 
we need a larger number of samples n for the simulated rate 
to converge to its true value, but M does not have to be large 
since the error probabilities not small. The above observations 
are in line with our intuition. 

C. Comparison of True Crossover Rate to Rate obtained from 
the Empirical Distribution 

In this subsection, we compare the true rate to the empirical 
rate, which is defined as 



J, 



inf 

Q£V(X 4 ) 



[D{Q\\P^) : I(Q e ,) = /(Qe)}- (63) 



The empirical rate J e e / — J e . e '(Pe.e') is a function of the 
empirical distribution P e , e '. This rate is computable by a 
learner, who does not have access to the true distribution P. 
The learner only has access to a finite number of samples 
x™ = {xi,...,x n }. Given x", the learner can compute 
the empirical probability P e e / and perform the optimization 
in d63l . This is an estimate of the true crossover rate. A natural 
question to ask is the following: Does the empirical rate J e e i 
converge to the true crossover rate J e e < as n — > 00? The next 
theorem answers this question in the affirmative. 

Theorem 12 ( Crossover Rate Consistency): The empirical 
crossover rate J e e ' in d63l converges almost surely to the true 
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Fig. 1 1 . Comparison of True, Approximate and Empirical Rates with 7 = 
0.01 (top) and 7 = 0.2 (bottom). Here n is the number of observations used 
to estimate the empirical distribution. 



crossover rate J e e / in ( fT9] l, i.e., 



lim J P .1 = J. 



e,e' 



= 1. 



(64) 



Proof: (Sketch) The proof of this theorem follows from 
the continuity of J e e > in the empirical distribution P e>e i and 
the continuous mapping theorem by Mann and Wald [39]. See 
Appendix [F] for the details. ■ 
We conclude that the learning of the rate from samples is 
consistent. Now we perform simulations to determine how 
many samples are required for the empirical rate to converge 
to the true rate. 

We set 7 = 0.01 and 7 = 0.2 in d62l ). We then drew n i.i.d. 
samples from P and computed the empirical distribution P e , e '- 
Next, we solved the optimization problem in ( |63l using the 
fmincon function in Matlab, using different initializations 
and compared the empirical rate to the true rate. We repeated 
this for several values of n and the results are displayed in 
Fig. QT| We see that for 7 = 0.01, approximately n = 8 x 
10 6 samples are required for the empirical distribution to be 
close enough to the true distribution so that the empirical rate 
converges to the true rate. 

IX. Conclusion 

In this paper, we presented a solution to the problem 
of finding the error exponent for tree structure learning by 
extensively using tools from large-deviations theory combined 
with facts about tree graphs. We quantified the error exponent 
for learning the structure and exploited the structure of the 



true tree to identify the dominant tree in the set of erroneous 
trees. We also drew insights from the approximate crossover 
rate, which can be interpreted as the SNR for learning. These 
two main results in Theorems [5] and [8] provide the intuition as 
to how errors occur for learning discrete tree distributions via 
the Chow-Liu algorithm. 

In a future paper, we develop counterparts to the results 
here for the Gaussian case. Many of the results carry through 
but thanks to the special structure that Gaussian distributions 
possess, we are also able to identify which structures are easier 
to learn and which are harder to learn given a fixed set of 
correlation coefficients. We are also interested to study the 
optimality of the error exponent associated with the ML Chow- 
Liu algorithm, i.e., whether the rate established is the best 
among all consistent estimators of the edge set. 
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Appendix A 
Proof of TheoremE] 

Proof: We divide the proof of this theorem into three 
steps. Steps 1 and 2 prove the expression in (1191 , Step 3 proves 
the existence of the optimizer. 

Step 1: First, we note from Sanov's Theorem [26, Ch. 11] 
that the empirical joint distribution on edges e and e! satisfies 

1 



lim log P(P e e , € B) = min {D(Q\\ P e 

n^oo n ' QeV{X 4 ) 



QeB] 
(65) 

for any set B C 7 , (A' 4 ) that equals the closure of its interior, 
i.e., B = cl(int(£>)). We now have a LDP for the sequence 
of probability measures P e . e /, the empirical distribution on 
(e, e'). Assuming that e and e! do not share a common node, 
Pe,e' S V{X^) is a probability distribution over four variables 
(the variables in the node pairs e and e'). We now define the 
function h : V(X^) -> R as 



HQ) := J(Qe') - HQe)- 



(66) 



Since Q e = J2 X , Q i s continuous in Q and the mutual infor- 
mation I(Q e ) is also continuous in Q e , we conclude that h is 
indeed continuous, since it is the composition of continuous 
functions. By applying the contraction principle [10] to the 
sequence of probability measures P e e > and the continuous map 
h, we obtain a corresponding LDP for the new sequence of 
probability measures h(P^ e i) = I(P e >) — I(Pe)> where the 
rate is given by: 

J e ,e'= inf {D(Q\\P ey ):h{Q)>0}, 



inf {D(Q\\P e . 

Q£V(X 4 ) 



: I(Qe>) > /(Qe)} • (67) 



We now claim that the limit in < fT~8T > exists. From Sanov's 
theorem [26, Ch. 11], it suffices to show that the constraint set 
B := {I(Q e >) > /(Qe)} in d67l ) is a regular closed set, i.e., 
it satisfies B = cl(int(S)). This is true because there are no 
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Qe.e' Qe,e' 
Fig. 12. Illustration of Step 2 of the proof of Theorem [2] 




Fig. 13. Illustration for Case 1 of the proof of Theorem|5] 



isolated points in B and thus the interior is nonempty. Hence, 
there exists a sequence of distributions {Q m }m=i c int(B) 
such that lim m _K X ,D(Q m ||P e , e /) = D(Q*||P e , e ')> which 
proves the existence of the limit. Step 2: We now show that the 
optimal solution Q* e ,, if it exists (as will be shown in Step 
3), must satisfy I(Q* e ) = I(Q*,). Suppose, to the contrary, 
that Q* ee , with objective value D(Q* e e , | \P e , e > ) is such that 
I(Q*,) > I{Q* e ). Then h(Q* ee ,) > 6, where h, as shown 
above, is continuous. Thus, there exists a <5 > such that the 
^-neighborhood 

N s (Qi e ,):={R:\\R-Ql e ,\\oo<5}, 

satisfies h(N s (Q* e e ,)) C (0, oo) [33, Ch. 2]. Consider the new 
distribution (See Fig. [T2l 



Q7,e- = Q*e 



2 (Pe,e 



Qt 



= 11-2)0: 



2 e ' e 



Note that Q** e , belongs to Ns{Q* ee ,) and hence is a feasi- 
ble solution of (|67j. We now prove that D(Q** e ,||P e , e ') < 
D(Qt e '||Pe,e')> which contradicts the optimality of Q* e e ,. 



D(QZ,\\P e , e ,) 



P, 



= D{Ql e ,\\Pe,e 

< D(Q*,\\P e>e ,), 



(68) 

(69) 
(70) 



where (1681 1 is due to the convexity of the KL-divergence in the 
first variable [26, Ch. 2], ((69) is because D(P eie /||P e , e ,) = 
and (T70b is because 5 > 0. Thus, we conclude that the optimal 
solution must satisfy I{Q* e ) = P\Q* e i) and the crossover rate 
can be stated as ( fl9] >. 

Step 3: Now, we prove the existence of the minimizer Q* e ,, 
which will allow us to replace the inf in ( fT9l with min. First, 
we note that D(Q \ \ P e . e ') is continuous in both variables and 
hence continuous and the first variable Q. It remains to show 
that the constraint set 



A := {Q £ V(X 4 ) : I(Q e >) = I(Q e )} 



(71) 



is compact, since it is clearly nonempty (the uniform distribu- 
tion belongs to A). Then we can conclude, by Weierstrauss' 
extreme value theorem [33, Theorem 4.16], that the minimizer 
Q* £ A exists. By the Heine-Borel theorem [33, Theorem 
2.41], it suffices to show that A is bounded and closed. 
Clearly A is bounded since V{X i ) is a bounded set. Now, 
A = /i _1 ({0}) where h is defined in d66*l l. Since h is 



continuous and {0} is closed (in the usual topology of the 
real line), A is closed [33, Theorem 4.8]. Hence that A is 
compact. We also need to use the fact that A is compact in 
the proof of Theorem [12] ■ 



Appendix B 
Proof of TheoremO 

Proof: We first claim that £p, the edge set correspond- 
ing to the dominant error tree, differs from £p by exactly 
one edge0 To prove this claim, assume, to the contrary, 
that £p differs from £p by two edges. Let £ ML = £' := 
£p \ {ei,e2} U {e^ei,}, where e! x ,e! 2 £ £p are the two 
edges that have replaced ei,e2 £ £p respectively. Since 
T' = (V, £') is a tree, these edges cannot be arbitrary and 
specifically, {ei,e2} G {Path(e^; £p) UPath(e 2 ; £p)} for the 
tree constraint to be satisfied. Recall that the rate of the event 
that the output of the ML algorithm is T is given by T(T') 
in d27l i. Then consider the probability of the joint event (with 
respect to the probability measure P = P n ). 

Case 1: a £ Path(e-;£p) for i = 1,2 and ^ 
Path(e^;fp) for i,j = 1,2 and i ^ j. See Fig.[TJ Note that 
the true mutual information quantities satisfy Z(P ej ) > I (P e '. )• 
We prove this claim by contradiction that suppose I{P e ' ) > 
I(P ei ) then, £p does not have maximum weight because if 
the non-edge e[ replaces the true edge e,-, the resulting trej^l 
would have higher weight, contradicting the optimality of the 
true edge set £p, which is the MWST with the true mutual 
information quantities as edge weights. More precisely, we can 
compute the exponent when T' is the output of the MWST 
algorithm: 



T(T') 



lim - 

n — >oc 



1 



logP| p| {/(P e ,) > J(P e J} ] , 
i i=l,2 



> max lim lo 

j=i,2n— >cx3 n 



{/(P<) > I{P ei )} / 
max{j ei , e / , J E2 ,e' 2 } ■ (72) 



Now J e . ie ^ = T(T,) where T t := (V,£ P \ {eJU {e^}). From 
Prop. 2] the error exponent associated to the dominant error 
tree, i.e., Kp = minp^Tp T(T) and from (l72l . the dominant 
error tree cannot be T' and should differ from Tp by one and 
only one edge. 

18 This is somewhat analogous to the fact that the second-best MWST differs 
from the MWST by exactly one edge [30]. 

19 The resulting graph is a indeed tree because {e^} U Path(e^; Ep) form 
a cycle so if any edge is removed, the resulting structure does not have any 
cycles and is connected, hence it is a tree. See Fig. f2] 
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gi ^ e 2 



Fig. 14. Illustration for Case 2 of the proof of Theorem [5] 



that if e* ^ £p replaces r(e*), then the error _A„ occurs. Thus, 
{e* replaces r(e*)} C -A„ and 

P(Ai) > P({e* replaces r(e*) in T ML }) = exp(-nJ r(e » )>e .)- 

Hence, P(.A n ) = exp(— nJ r ( e ») je * ), which proves our main 
result in d33j. ■ 



Case 2: € Path(e^;£ P ) for i,j 
Now, we have I{P ei ) > I{Pe'.) f° r j : 



= 1,2. See Fig. Q3 
1,2 and 



T(T') = lim - - log 

n— >oo 77, 



( n U(^)>/(Pe,)} 



i=l,2 



u( n {/(^)>/(p ei )> 



i=l,2 



(73) 



Hence, we have 



Appendix C 
Proof of Theorem[6] 

Statement (a) O- statement (b) was proven in full after the 
theorem was stated. Here we provide the proof that (b) (c). 
Recall that statement (c) says that Tp is not a proper forest. 
We first begin with a preliminary lemma. 

Lemma 13: Suppose x, y, z are three random variables tak- 
ing on values on a finite set X. Assume that P(x, y, z) > 
everywhere. Then x — y — z and x — z — y are Markov chains 
if and only if x is jointly independent of y,z. 

Proof: (==>) That x — y — z is a Markov chain implies that 



T(T') > min < max J e . e > , max J„ . e > > . (74) 

[i=l,2 " i i,j=l,2,jjH " i J 



P(z\y,x) 



P(x,y,z) = P(x,y) 



P(z\y), 
P(y,z) 



(80) 



or alternatively 

Again, T' cannot be the dominant error tree. 

Similar results can be shown for the case when e, € 
Path(e-;£ P ) for i = 1,2, e 2 € Path(e' 1 ;£ P ) and e x £ 

Path(e 2 ;£ P ) and the case when e* € Path(e^;£ P ) for Similarly from the fact that x - z 
i = 1,2, ei € Path(e' 2 ;£ P ) and e 2 ^ Path(ei;£ P ). have 

We now use the "worst-exponent-wins principle" [10, Ch. P(x,y,z) = P(x,z)- 

1], to conclude that the rate that dominates is the minimum 

— ^ ' P(z) 

J r (e'),e' over all possible e' £p, namely J r r e »), e * Wlt h e* Equating (180b and d8lT >. and use the positivity to cancel 
defined in d34l i. More precisely, P(y, z ), we arrive at 



P(y) 

y is a Markov chain, we 

(81) 



P(A 



= P[ |J {e' replaces any e € Path(e';£ P ) in T ML }J , 

V££p / 

= P( (J (J {e' replaces e in T ML } J , 

\ e'£Sp eGPath(e';£ j>) / 

< E E P ^ e ' re P laces e in t ml}), (75) 

e'gEp e£Path(e';£p) 

= E E nUOPe') > /(£)}), (76) 

e'<££ P e6Path(e';fp) 

= E E exp(-nJ eie /), (77) 

e'^£ P eGPath(e';£p) 

(78) 



= exp — n min min J e e < i , 

\ e'^£ P e£Path(e';£p) 

where d75T l is from the union bound, d76l l and (fTTT i are from 
the definitions of the crossover event and rate respectively (as 
described in Cases 1 and 2 above) and (|78l is an application 
of the "worst-exponent-wins" principle [10, Ch. 1]. 
We conclude from ( |78l that 



IP (An) < exp(-nJ r ( e »), e «), 



(79) 



from the definition of the dominant replacement edge r(e') 
and the dominant non-edge e*, defined in d32l and (|34l 
respectively. The lower bound follows trivially from the fact 



P(x\y) = P(x\z). 

It follows that P(x\y) does not depend on y, so there is some 
constant C(x) such that P(x\y) = C(x) for all y € X. This 
immediately implies that C(x) — P(x) so that P(x\y) = 
P(x). A similar argument gives that P(x\z) — P(x). Fur- 
thermore, if x — y — z is a Markov chain, so is z — y — x, 
therefore 

P(i|y,z)=P(a!|l/) = P(a;). (82) 

The above equation says that x is jointly independent of both 
y and z. 

Conversely, if x is jointly independent of both y and z, 
then x — y — z and x — z — y are Markov chains. ■ 

Proof: We now prove (b) ■<=>• (c) using Lemma Qj] and 
the assumption that P(x) > for all x S A* . 
(=») If (b) is true then I(P e .) ^ I(P e ) for all e e Path(e'; £ P ) 
and for all e' ^ £ p . Assume, to the contrary, that T P is a 
proper forest, /.e., it contains at least 2 connected components 
(each connected component may only have one node), say 
Gi = (Vi,£i) for i — 1,2. Without loss of generality, let 
X\ be in component Q\ and x 2 ,xs belong to component Q 2 . 
Then since Vi PI V 2 = and Vi U V 2 = V, we have that x\ 
jointly independent of x 2 and x%. By Lemma [13] we have 
the following Markov chains x\ — x 2 — x 3 and X\ — x 3 — x 2 - 
This implies from the Data Processing Inequality [26, Theorem 
2.8.1] that /(Pi, 2 ) > I(Pi,a) and at the same time /(Pi, 2 ) < 
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/(Pl,3) which means that /(Pi, 2) = I (Pi, 3)- This contradicts 
(b) since by taking e' = (1, 2), the mutual informations along 
the path Path(e';£p) are no longer distinct. 
(<=) Now assume that (c) is true, i.e., Tp is not a proper forest. 
Suppose, to the contrary, (b) is not true, i.e., there exists a 
e' £ £p such that I(P C >) — I(P r i e i\), where r(e') is the 
replacement edge associated with the non-edge e'. Without 
loss of generality, let e' = (1,2) and r(e') = (3,4), then 
since Tp is not a proper forest, we have the following Markov 
chain x\ — £3 — X4 — £2. Now note that /(Pi, 2) = /(Ps,4). In 
fact, because there is no loss of mutual information /(Pi 4) = 
/(P3,4) and hence by the Data Processing Inequality we also 
have £3 — xi — xa — xi- By using Lemma Qj] we have 24 
jointly independent of x\ and X3, hence we have a proper 
forest, which is a contradiction. ■ 

Appendix D 
Proof of Theorem[8] 

Proof: The proof proceeds in several steps. See Figs. @] 
and [3] for intuition behind this proof. 
Step T. Let Q be such that 



Q(x.j , Xj , Xfc , X{ ) Pe,e' (^2 ) 3>j 1 •^ki ^l) ^i.j ,k,l • 



(83) 



Thus, the e^- k j's are the deviations of Q from P e , e '. To ensure 
that Q is a valid distribution we require ^ £i,j,k,l — 0. The 
objective in (l44l can now be alternatively expressed as 



2 



E 



c i,j,k,l 



_ 1 

2 Pe.e' i 30 j ) ^fc 5 ) 



(84) 



Sfe/? 3: The optimization problem now reduces to minimiz- 
ing d84l subject to the constraints in d88l . This is a standard 
least-squares problem. By using the Projection Theorem in 
Hilbert spaces, we get the solution 



/(P e ) - /(P e . 




(90) 



The inverse of L e e 'K e e >Lj e , exists because we assumed Tp 
is not a proper forest and hence P 4J 7^ PiPj f° r a ll (*■ j) S 



This is a sufficient condition for the matrix L„ e / to have 



= full row rank and thus, L„ „/ K~ 



e,e 



is invertible. Finally, 



we substitute e* in d90b into (l84l to obtain 
1 



Je,e' — 



(Le„ 



T 

6,6' 



11 



(I(P e )-I(P e ,)) 2 , (91) 



where [M]n is the (1,1) element of the matrix M. Define tp 
to be the weighting function given by 



?/>(Pe, e <) := (L e , 



,K 1 T T 



11 



(92) 



It now suffices to show that ip(P e , e i) is indeed the inverse 
variance of s e — s e i. We now simplify the expression for the 
weighting function ?/>(Pe,e') recalling how L e e / and K e e / are 
defined. The product of the matrices in (l92l is 

\21 



-1 t t 



E[(s e , 
E[s fi 



s e 



Efs„ 



1 



(93) 



where all expectations are with respect to the distribution P e eJ . 
Note that the determinant of ((93]l is E[(s e > - s e ) 2 ] - E[(s e / - 
s e )] 2 = Var(s e ' — s e ). Hence, the (1,1) element of the inverse 
of (|93l is simply 



where e 6 is the vectorized version of the deviations 

£ i,j,k,i and K e e ' is a |A"| 4 x \X\ 4 diagonal matrix containing 
the entries l/P eye >(xi,Xj,Xk,xi) along its diagonal. 

Step 2: We now perform a first-order Taylor expansion of 
I(Q e ) in the neighborhood of /(P e ). 



/(Q e ) = /(P e ) + e T VJ(Q e 
= /(P e ) + e T s e + o(||e 



e=0 



°(NI 



(85) 
(86) 



where s e is the length |A"| 4 -vector that contains the infor- 
mation density values of edge e. Note that because of the 
assumption that P is not a proper forest, P%a Pi Pj for all 
(i,j), hence the linear term does not vanishi 20 ! The constraints 
can now be rewritten as 



e T l = 0, 



e T (s e , -s e ) = I(P e ) - I(P e ,). (87) 



or in matrix notation as: 

T T 
S e' " S e 

1 T 



I(P e )-I(P e .) 





(88) 



where 1 is the length-lA^ 4 vector consisting of all ones. For 
convenience, we define L e e / to be the matrix in d88l . i.e., 



Lr.r' '■ — 



T 



(89) 



20 Indeed if P e were a product distribution, the linear term in {86) vanishes 
and I[Qe) is approximately a quadratic in e (as shown in [13]). 



i>(P e > 



Var(s e 



0" 



Now, if e and e' share a node, this proof proceeds in exactly 
the same way. In particular, the crucial step (f86l > will also 
remain the same since the Taylor expansion does not change. 
This concludes the first part of the proof. 

Step 4: Recall that we use the notation ct\ «j 02 to denote 
that \a\ — o^l < S. Now, assuming that P e e / satisfies the e- 
very noisy condition, then the following continuity statements 
hold: 



3d! >0s.t. /(P e )« 5l Z(Pe'), 



P 



e.e 00 



IP, 



3S 2 >0s.t. ||Q* ( 

3 5 3 > s.t. D(Q 

3 5 4 >0s.t. /(P e ) «5 4 sf(Q*, 
3 (5 > s.t. J e , e / « 5 J e , e /, 



< s 2 , 

I, 



S3 2 IIQe.e' Pe 
Pe.e/ ) , 



(94) 
(95) 

(96) 
(97) 
(98) 



where j94l follows from the continuity of mutual information, 
( |95] > follows because P e e ' is (52-close to the constraint set 
{Q : /(Q e ' > /(Qe)} and hence 62 -close to the optimal 
solution Q* e /, d96l l follows from the approximation of the 
KL-divergence, (|97| i follows from the Taylor expansion of the 
mutual information. Finally, d98l l follows from continuity of 
the objective in (|96T l and the constraints d97V Eqn. ( |98] l says 
that J e e / depends continuously on e (in the definition of the 
e-very noisy condition), i.e., J e ^ e > — > J e e ' as e — > 0. This 
completes the proof. ■ 
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Appendix E 
Proof of Proposition [TOl 

Proof: The following facts about P in Table Q] can be 
readily verified: 

1) P is positive everywhere, i.e., -P(x) > for all x e X 3 . 

2) P is Markov on the complete graph with d = 3 nodes, 
hence P is not a tree distribution. 

3) The mutual information between X\ and X2 as a function 
of k is given by 

-f(-Pi,2) = log 2 + (1 - 2k) log(l - 2k) + 2relog(2/s). 

Thus /(Pi, 2 ) -> log 2 = 0.693 as k -> 0. 

4) For any £ x K e (0, 1/3) x (0, 1/2), J(P 2 , 3 ) = /(P 1)3 ) 
and this pair of mutual information quantities can be 
made arbitrarily small as n — > 0. 

Thus, for sufficient small k, I(Pl, 2 ) > PvPa.a) = /(Pt,3). We 
conclude that the Chow-Liu MWST algorithm will first pick 
the edge (1,2) and then arbitrarily choose between the two 
remaining edges: (2,3) or (1,3). ■ 

Appendix F 
Proof of TheoremO 

We first state and prove two preliminary lemmas. The- 
orem Q~2] will then be an immediate consequence of these 
lemmas. 

Lemma 14: Let X and Y be two metric spaces and let JC C 
AT be a compact set in X. Let f : X xY ^ M.be a continuous 
real-valued function. Then the function g : Y — > R, defined as 

g(y) := mm f(x,y), VyeY, (99) 

is continuous on Y. 

Proof: Set the minimizer in d99l ) to be 

x(y) := axgminf(x,y). (100) 

The optimizer x(y) S JC exists since f(x, y) is continuous 
on JC for each y E Y and JC is compact. This follows from 
Weierstrauss' extreme value theorem [33, Theorem 4.16]. We 
want to show that for \im y r^ y g(y') — g(y). In other words, 
we need to prove that 

lim f(x{y'),y') -> f(x(y),y). 
v->v 

Consider the difference, 

\f(x(y'),x')-f(x(y),y)\ < \f(x(y),y)-f(x(y),y')\ 

+ \f(x{y),y')-f{x(y'),y')\. (101) 

The first term in ( 1101b tends to zero as y' — > y by the 
continuity of / so it remains to show that the second term, 

B = \f(x(y),y') - f{x(y'),y')\ -> 0, as y> -> y. Now, we 
can remove the absolute value since by the optimality of x(y'), 
f(x(y),y') >f(x(y'W). Hence, 

B = f(x(y),y')-f(x(y'),y'). 

Suppose, to the contrary, there exists a sequence {y' n \'^f = i C Y 
with y' n — > y such that 

/(x(!/)y n )-/(x(l4),l4)>c>0, VneN. (102) 



By the compactness of JC, for the sequence {x(y' n )}^ =i C JC, 
there exists a subsequence {x(y')}^_ 1 C /C whose limit is 
x* = liirife^oo x{y' nh ) and x* € t [33, Theorem 3.6(a)]. By 
the continuity of / 

lim f(x(y),y' nk ) = f(x(y),y), (103) 

lim /(ar(i^ J, i^J = /(!•,!/), (104) 

k— »oo 

since every subsequence of a convergent sequence {y^} con- 
verges to the same limit y. Now (11021 i can be written as 

/(x(y),i4J-/(x(i4J,i4J>e>0, VfceN. (105) 

We now take the limit as k — > oo of ( 1105b . Next, we use (1 1 031 > 
and ( 1104b to conclude that 

f(x(y),y) - f(x*,y)>e^ f(x(y),y) > f(x*,y)+e,(l06) 

which contradicts the optimality of x(y) in (1100b . Thus B — > 
and lirriy-^ ff(y') = g(y), which demonstrates the continuity 
of g on Y. ■ 

Lemma 15 (The continuous mapping theorem [39]): Let 
(CI, B(Q), v) be a probability space, where CI is a set, B(Q) is 
the Borel er-field over Cl and ^ is a measure. Let the sequence 
of random variables {1„}^ =1 on fl converge almost surely 
to X, i.e., X n X. Let g : fl — > R be a continuous 
function. Then g(X n ) g(X). 

Proof: Now, using Lemmas [14] and Q3] we complete 
the proof of Theorem [12] First we note from d63l that 
J e , e ' = X,e'(Pe,e')' ! - e -' ^e,e' is a function of the empirical 
distribution on node pairs e and e'. Next, we note that 
D(Q\\P e:f ,i) is a continuous function in (Q,P e ^ e >). If P ee / 
is fixed, the expression d63l is a minimization of D(Q||P ee /), 
over the compact se0 A = {Q e P(A" 4 ) : I(Q e ,) = I(Q e )}, 
hence Lemma [14] applies (with the identifications f = D 
and A = JC) which implies that J e>e i is continuous in the 
empirical distribution P e , e '- Since the empirical distribution 
P ej e' converges almost surely to P e>e i by strong typicality [26, 
Sec. 11.2], J e , e '(Pe,e') a ls° converges almost surely to J e , e '> 
by Lemma [T31 ■ 
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